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Abstract 

Low-rank matrix approximations are often used to help scale standard 
machine learning algorithms to large-scale problems. Recently, matrix 
coherence has been used to characterize the ability to extract global in- 
formation from a subset of matrix entries in the context of these low-rank 
approximations and other sampling-based algorithms, e.g., matrix com- 
pletion, robust PCA. Since coherence is defined in terms of the singular 
vectors of a matrix and is expensive to compute, the practical significance 
of these results largely hinges on the following question: Can we effi- 
ciently and accurately estimate the coherence of a matrix? In this paper 
we address this question. We propose a novel algorithm for estimating 
coherence from a small number of columns, formally analyze its behavior, 
and derive a new coherence-based matrix approximation bound based on 
this analysis. We then present extensive experimental results on synthetic 
and real datasets that corroborate our worst-case theoretical analysis, yet 
provide strong support for the use of our proposed algorithm whenever 
low-rank approximation is being considered. Our algorithm efficiently and 
accurately estimates matrix coherence across a wide range of datasets, and 
these coherence estimates are excellent predictors of the effectiveness of 
sampling-based matrix approximation on a case-by-case basis. 



1 Introduction 

Large-scale datasets are becoming more and more prevalent for problems in a 
variety of areas, e.g., computer vision, natural language processing, computa- 
tional biology. However, several standard methods in machine learning, such 
as spectral clustering, manifold learning techniques, kernel ridge regression or 
other kernel-based algorithms do not scale to such orders of magnitude. For 
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large datasets, these algorithms would require storage and operation on matri- 
ces with thousands to millions of columns and rows, which is especially prob- 
lematic since these matrices are often not sparse. An attractive solution to such 
problems involves efficiently generating low-rank approximations to the original 
matrix of interest. In particular, sampling-based techniques that operate on a 
subset of the columns of the matrix can be effective solutions to this problem, 
and have been widely studied within the machine learning and theoretical com- 
puter science communities (Drineas et al, 2006; Frieze et al, 1998; Kumar et 
al, 2009b; Williams and Seeger, 2000). In the context of kernel matrices, the 
Nystrom method (Williams and Seeger, 2000) has been shown to work particu- 
larly well in practice for various applications ranging from manifold learning to 
image segmentation (Fowlkes et al, 2004; Talwalkar et al, 2008). 

A crucial assumption of these algorithms involves their sampling-based na- 
ture, namely that an accurate low-rank approximation of some matrix X € 
jgmxm can k e g Cncra ted exclusively from information extracted from a small 
subset (Z <C m) of its columns. This assumption is not generally true for all 
matrices, and explains the negative results of Fergus et al. (2009). For instance, 
consider the extreme case: 



X = 



ei 



(1) 



where e, is the ith column of the n dimensional identity matrix and is the n 
dimensional zero vector. Although this matrix has rank r, it cannot be well 
approximated by a random subset of I columns unless this subset includes 
ei, . . . ,e r . In order to account for such pathological cases, previous theoret- 
ical bounds relied on sampling columns of X in an adaptive fashion (Bach and 
Jordan, 2005; Dcshpandc et al, 2006; Kumar et al, 2009b; Smola and Scholkopf, 
2000) or from non-uniform distributions derived from properties of X (Drineas 
and Mahoney, 2005; Drineas et al, 2006). Indeed, these bounds give better guar- 
antees for pathological cases, but are often quite loose nonetheless, e.g., when 
dealing with kernel matrices using RBF kernels, and these sampling schemes 
are rarely utilized in practice. 

More recently, Talwalkar and Rostamizadch (2010) used the notion of co- 
herence to characterize the ability to extract information from a small subset 
of columns, showing theoretical and empirical evidence that coherence is tied 
to the performance of the Nystrom method. Coherence measures the extent to 
which the singular vectors of a matrix are correlated with the standard basis. 
Intuitively, if the dominant singular vectors of a matrix are incoherent, then the 
subspace spanned by these singular vectors is likely to be captured by a random 
subset of sampled columns of the matrix. In fact, coherence-based analysis of 
algorithms has been an active field of research, starting with pioneering work 
on compressed sensing (Candes et al, 2006; Donoho, 2006), as well as related 
work on matrix completion (Candes and Recht, 2009; Keshavan et al, 2009b) 
and robust principle component analysis (Candes et al, 2009). 
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In Candes and Recht (2009), the use of coherence is motivated by results 
showing that several classes of randomly generated matrices have low coher- 
ence with high probability, one of which is the class of matrices generated from 
uniform random orthonormal singular vectors and arbitrary singular values. 
Unfortunately, these results do not help a practitioner compute coherence on a 
case-by-case basis to determine whether attractive theoretical bounds hold for 
the task at hand. Furthermore, the coherence of a matrix is by definition derived 
from its singular vectors and is thus expensive to compute (the prohibitive cost 
of calculating singular values and singular vectors is precisely the motivation 
behind sampling-based techniques). Hence, in spite of the numerous theoretical 
work based on related notions of coherence, the practical significance of these 
results largely hinges on the following open question: Can we efficiently and 
accurately estimate the coherence of a matrix? 

In this paper we address this question by presenting a novel algorithm for 
estimating matrix coherence from a small number of columns. The remainder 
of this paper is organized as follows. Section 2.1 introduces basic definitions, 
and provides a brief background on low-rank matrix approximation and matrix 
coherence. In Section 3 we introduce our sampling-based algorithm to estimate 
matrix coherence. We then formally analyze its behavior in Section 4, and also 
use this analysis to derive a novel coherence- based bound for matrix projection 
reconstruction via Column-sampling (defined in Section 2.2). Finally, in Sec- 
tion 5 we present extensive experimental results on synthetic and real datasets. 
These results corroborate our worst-case theoretical analysis, yet provide strong 
support for the use of our proposed algorithm whenever sampling-based matrix 
approximation is being considered. Empirically, our algorithm effectively esti- 
mates matrix coherence across a wide range of datasets, and these coherence 
estimates are excellent predictors of the effectiveness of sampling-based matrix 
approximation on a case-by-case basis. 

2 Background 
2.1 Notation 

Let X <E K™ xm k e an arbitrary matrix. We define X"', j = 1 . . . m, as the jth 
column vector of X, X(j), i = 1 . . .n, as the zth row vector of X and X.y as 
the ijth entry of X. We denote by ||X||f the Frobenius norm of X and by ||v|| 
the I2 norm of the vector v. If rank(X) = r, we can write the thin Singular 
Value Decomposition (SVD) as X = Ua-Sa'VJ^. Ex is diagonal and contains 
the singular values of X sorted in decreasing order, i.e., ci(X) > 02 (X) > 
... > o>(X). Ujc € W lXr and V A - € R mxr have orthogonal columns that 
contain the left and right singular vectors of X corresponding to its singular 
values. We define Pj = UaUa as the orthogonal projection matrix onto the 
column space of X, and denote the projection onto its orthogonal complement 
as P x ,± = I Px We further define X+ £ R mxn as the Moore-Penrose 
pseudoinverse of X, with X+ = VaE+UJ . Finally, we will define K € K" x " 
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as a symmetric positive semidefinite (SPSD) matrix with rank(K) = r < n, i.e. 
a symmetric matrix with non- negative eigenvalues. 



2.2 Low-rank matrix approximation 

Starting with an n x m matrix, X, we arc interested in algorithms that generate 
a low-rank approximation, X, from a sample of I <C n of its columns. The 
accuracy of this approximation is often measured using the Frobcnius or Spectral 

We next briefly describe two of the most common 



distance, i.e., ||X — XI 



{2,F} 



algorithms of this form, the Column-sampling and the Nystrom methods. 

The Column-sampling method generates approximations to arbitrary rect- 
angular matrices. We first sample I columns of X such that X = [Xi X2] , 
where Xi has I columns, and then use the SVD of Xi, Xi = Ux^XiV^- , to 
approximate the SVD of X (Frieze et ai, 1998). This method is most commonly 
used to generate a 'matrix projection' approximation (Kumar et ai, 2009b) of 
X as follows: 

X coi =U Xl UT X . (2) 



The runtime of the Column-sampling method is dominated by the SVD of Xi 
which takes 0(nl 2 ) time to perform and is feasible for small I. 

In contrast to the Column-sampling method, the Nystrom method deals only 
with SPSD matrices. We start with an n x n SPSD matrix, sampling I columns 
such that K = [Ki K2] , where Ki has I columns, and define W as the I x I 
matrix consisting of the intersection of these I columns with the corresponding 
I rows of K. Since K is SPSD, W is also SPSD. Without loss of generality, we 
can rearrange the columns and rows of K based on this sampling such that: 



K 



W 

Kr K 2 



where Ki 



W 



and Ko 



XT' 



(3) 



The Nystrom method uses W and Ki from (3) to generate a 'spectral recon- 
struction' (Kumar et al, 2009b) approximation of K as K nys = KiW+K^. 
Since the running time complexity of SVD on W is in 0(l 3 ) and matrix multi- 
plication with Ki takes 0(l 2 n), the total complexity of the Nystrom approxi- 
mation computation is also in 0(l 2 n). 



2.3 Matrix Coherence 

Matrix coherence measures the extent to which the singular vectors of a matrix 
are correlated with the standard basis. As previously mentioned, coherence 
has been to analyze techniques such as compressed sensing, matrix completion, 
robust PCA, and the Nystrom method. These analyses have used a variety of 
related notions of coherence. If we let e, be the ith column of the standard 
basis, we can define three basic notions of coherence as follows: 
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Definition 1 (/x-Coherence). Let U € M. nxr contain orthonormal columns with 
r < n. Then the \i-coherence o/U is: 

jti(U) = \Jn max I Uy I . (4) 

Definition 2 (/io-Cohcrcncc). LetXJ € R" xr contain orthonormal columns with 
r < n and define Pu = UU T as its associated orthogonal projection matrix. 
Then the /io- coherence o/U is: 

/i (U) = - max IjPc/eiH 2 = max !|U (l) || 2 . (5) 

r l<i<n l<i<n 

Definition 3 (/ii-Coherence). Given the matrix X £ ]^™xm w ^ ram \ r ^ [ e ji 

and right singular vectors, X3x and Vx, and define T = Ei<Kr^f'^x' ■ 
Then, the [i\-coherence of X is: 



/ii (X) = y — — max |TV,- 1 . (6) 

In Talwalkar and Rostamizadch (2010), /i(U) is used to provide coherence- 
based bounds for the Nystrom method, where U corresponds to the singular 
vectors of a low-rank SPSD kernel matrix. Low-rank matrices are also the 
focus of work on matrix completion by Candes and Recht (2009) and Kcshavan 
et al. (2009b), though they deal with more general rectangular matrices with 
SVD X = UjsfSxV^, and they use ^o(Ux), Mo(Vjc) and ^i(X) to bound 
the performance of two different matrix completion algorithms. Note that a 
stronger, more complex notion of coherence is used in Candes and Tao (2009) to 
provide tighter bounds for the matrix completion algorithm presented in Candes 
and Recht (2009) (definition omitted here). Moreover, coherence has also been 
used to analyze algorithms dealing with low-rank matrices in the presence of 
noise, e.g., Candes and Plan (2009); Keshavan et al. (2009a) for noisy matrix 
completion and Candes et al. (2009) for robust PCA. In these analyses, the 
coherence of the underlying low-rank matrix once again appears in the form of 
^o(-) and /Ui(-). 

In this work we choose to focus on /io- In comparison to fi, fio is a more 
robust measure of coherence, as it deals with row norms of U, rather than 
the maximum entry of U, and the two notions are related by a simple pair of 
inequalities: [i 2 /r < /io < fi 2 . Furthermore, since we focus on coherence in 
the context of algorithms that sample columns of the original matrix, /xo is a 
more natural choice than fi\, since existing coherence-based bounds for these 
algorithms (both in Talwalkar and Rostamizadch (2010) and in Section 4 of this 
work) only depend on the left singular vectors of the matrix. 



3 Estimate-Coherence Algorithm 

As discussed in the previous section, matrix coherence has been used to analyze 
a variety of algorithms, under the assumption that the input matrix is either 
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exactly low-rank or is low-rank with the presence of noise. In this section, we 
present a novel algorithm to estimate the coherence of matrices following the 
same assumption. Starting with an arbitrary nxm matrix, X, we arc ultimately 
interested in an estimate of /io(Ux), which contains the scaling factor n/r as 
shown in Definition 2. However, our estimate will also involve singular vectors 
in dimension n, and as wc mentioned above, r is assumed to be small. Hence, 
neither of these scaling terms has a significant impact on our estimation. As 
such, our algorithm focuses on the closely related expression: 

7 (U)= max llP^eJ 2 = -//,,. (7) 

l<i<n n 

Our proposed algorithm is quite similar in flavor to the Column-sampling 
algorithm discussed in Section 2.2. It estimates coherence by first sampling I 
columns of the matrix and subsequently using the left singular vectors of this 
submatrix to obtain an estimate. Note that our algorithm applies both to exact 
low-rank matrices as well as low-rank matrices perturbed by noise. In the latter 
case, the algorithm requires a user-defined low-rank parameter, r. The runtime 
of this algorithm is dominated by the singular value decomposition of the n x I 
submatrix, and hence is in 0(l 2 n). The details of the Estimate-Coherence 
algorithm are presented in Figure 1. 

Input: n x / matrix (Xi) storing I columns of arbitrary nxm matrix X, low- 
rank parameter (r) 

Output: An estimate of the coherence of X 

Estimate-Coherence(Xi, r) 

1 Uxi SVD(Xi) > keep left singular vectors 

2 q i— min (rank(Xi), r) 

3 U <s— Truncate (U^ , q) > keep top q singular vectors of Xi 

4 7(Xi) 4 — Calculate-Gamma(U) > see equation (7) 

5 return 7(Xi) 

Figure 1: The proposed sampling-based algorithm to estimate matrix coherence. 
Note that r is only required when X is perturbed by noise. 



4 Theory 

In this section we present a theoretical analysis of Estimate- Coherence when 
used with low-rank matrices. Our main theoretical results are presented in 
Theorem 1. 

Theorem 1. Define X £ R nXm with rank(X) = r <C n, and denote by XJx the 
r left singular vectors of X corresponding to its non-zero singular values. Let the 
orthogonal projection onto span(X.\) be denoted by Pxi = Ux^Ux , and define 
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the projection onto its orthogonal complement as Pjti,x< Let Xi be a set of I 
columns of X sampled uniformly at random, and let x be a column of X that is 
not in Xi that is sampled uniformly at random. Then the following statements 
can be made about 7(Xi), which is the output of Estimate-Coherence (Xi ): 

1. 7(Xi) is a monotonically increasing estimate o/7(X). Furthermore, if 
Xi = [Xi x] with x_l = P Xl ,ix, then < 7(Xi) - 7(Xi) < 7(z), 
where z = xj_/||xj_|| . 

2. 7(Xi) = 7(X) when rank(Xi) = rank(X), and the probability of this event 
is dependent on the coherence o/X. Specifically, for any 5 > ; it occurs 
with probability 1 — 8 for I > r 2 fio(Vx) max (C\ log(r), Ci log(3/<5)) for 
positive constants C\ and C2. 

The second statement in Theorem 1 leads to Corollary 1, which relates ma- 
trix coherence to the performance of the Column-sampling algorithm when used 
for matrix projection on a low-rank matrix. 

Corollary 1. Assume the same notation as defined in Theorem 1, and let 
X coi be the matrix projection approximation generated by the Column- sampling 
method using Xi, as described in (2). Then, for any S > 0, X co ' = X with prob- 
ability 1—5, for I > r 2 [io(TJx) max (C± log(r), C2 log(3/<5)) for positive constants 
C\ and C%. 

Proof. When rank(Xi) = rank(X), the columns of Xi span the columns of 
X. Hence, when this event occurs, projecting X onto the span of the columns 
of Xi leaves X unchanged. The second statement in Theorem 1 bounds the 
probability of this event. □ 

4.1 Proof of Theorem 1 

We first present Lemmas 1 and 2, and then complete the proof of Theorem 1 
using these lemmas. 

Lemma 1. Assume the same notation as defined in Theorem 1. Further, let 
Px' be the orthogonal projection onto spanfX.^) and define s = ||xj_||. Then, 
for any I G [1, n — 1], the following equalities relate the projection matrix Px[ 



Proof. First assume that s = 0, which implies that x is in the span of the 
columns of Xi. Since orthogonal projections are unique, then clearly Px; = 
Pxi m this case. Next, assume that s > 0, in which case the span of the 
columns of X' x can be viewed as the subspace spanned by the columns of Xi 
along with the subspace spanned by the residual of x, i.e., Xj_. Observe that 
zz T is the orthogonal projection onto span(xjj. Since these two subspaces are 
orthogonal and since orthogonal projection matrices are unique, we can write 



to P 
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Pjc{ as the sum of orthogonal projections onto these subspaces, which matches 
the statement of the lemma for s > 0. □ 



Lemma 2. Assume the same notation as defined in Theorem 1. Then, if I > 
r 2 /io(Uj5f) max (C\ log(r), C2 log(3/<5)), where C\ and C2 are positive constants, 
then for any 8 > 0, with probability at least 1 — 5, rank(Xi) = r. 

Proof. Assuming uniform sampling at random, Talwalkar and Rostamizadch 
(2010) shows that Pr[rank(Xi) = r] > Pr (\\cV x d V x ,i - 1|| 2 < l) for any c > 0, 
where ~Vx,i ei lxr corresponds to the first I components of the r right singular 
vectors of X. Applying Theorem 1.2 in Candes and Romberg (2007) and using 
the identity r^o > /1 2 yields the statement of the lemma. □ 

Now, to prove Theorem 1 wc analyze the difference: 
A, = | 7 (X' 1 )- 7 (X 1 ) 

If s = ||xj_|| = 0, then by Lemma 1, A; = 0. If s > 0, then using Lemma 1 and 
(9) yields: 

maxeJP Xl ei (10) 

i 

(11) 

In (10), we use the fact that orthogonal projections are always SPSD, which 
means that eJzz T ej > for all j and ensures that A/ > 0. In (11) we decouple 
the max(-) over Px t and zz T to obtain the inequality and then apply the 
definition of 7 (-), which yields the first statement of Theorem 1. Finally, the 
second statement of Theorem 1 follows directly from Lemma 1 when s = along 
with Lemma 2, as the former shows that A/ = if rank(Xi) = rank(X) and 
the latter gives a coherence-based finite-sample bound on the probability of this 
event occurring. 

5 Experiments 

Theorem 1 suggests that the ability to estimate matrix coherence is dependent 
on the coherence of the matrix itself. In fact, if we adversarially construct a 
high coherence matrix and select columns from this matrix in an unfortunate 
manner, the results are quite discouraging. For instance, imagine that we gener- 
ate a random SPSD matrix, e.g., using the Rand function in Matlab, and then 
replace its first diagonal with an arbitrarily large value, leading to a very high 
coherence matrix. If we subsequently force our sampling mechanism to ignore 
the first column of this matrix, we are completely unable to estimate coherence 
using Estimate-Coherence, as illustrated in Figure 2 on a synthetic matrix 
generated in Matlab following this procedure, with n = 1000 and k = 50. 



maxeJPx;ej — maxe^Pjje, 



(9) 



A; = maxeJ(Pxi + zz;T ) e j — 

3 

< maxeJzz T ej = 7 (z). 
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Gamma Estimation (Worst Case) 

1 

0.8 
g 0.6 
3 0.4 

0.2 


200 400 600 800 1000 
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Figure 2: Synthetic dataset illustrating worst-case performance of Estimate- 
Coherence. 

In spite of these discouraging worst-case results, our extensive empirical 
studies show that Estimate-Coherence performs quite well in practice on a 
variety of synthetic and real datasets with varying coherence, suggesting that 
the worst case addressed in theory and matched empirically in Figure 2 is rarely 
encountered in practice. We present these results in the remainder of this sec- 
tion. 

5.1 Experiments with synthetic data 

We first generated low-rank synthetic matrices with varying coherence and sin- 
gular value spectra, with n = m = 1000, and r = 50. To control the low- 
rank structure of the matrix, we generated datasets with exponentially decay- 
ing eigenvalues with differing decay rates, i.e., for i S {1, . . . , r} we defined the 
ith singular value as a; = exp(— irj), where r\ controls the rate of decay and 
Vsiow = .01, rimedium = -1, Vfast = -5. To control coherence, we independently 
generated left and right singular vectors with varying coherences by manually 
defining one singular vector and then using QR to generate r — 1 additional 
orthogonal vectors. We associated this coherence-inducing singular vector with 
the r/2 largest singular value. We defined our 'low' coherence model by forcing 
the coherence-inducing singular vector to have minimal coherence, i.e., setting 
each component equal to 1/^/n. Using this as a baseline, we used 3 and 8 times 
this baseline to generate 'mid' and 'high' coherences (see Figure 3(a)). We 
then used Estimate-Coherence with varying numbers of sampled columns 
to estimate matrix coherence. Results reported in Figure 3(b-d) are means and 
standard deviations of 10 trials for each value of I. Although the coherence esti- 
mate converges faster for the low coherence matrices, the results show that even 
in the high coherence matrices, Estimate-Coherence recovers the true coher- 
ence after sampling only r columns. Further, we note that the singular value 
spectrum influences the quality of the estimate. This observation is due to the 
fact that the faster the singular values decay, the greater the impact of the r/2 
largest singular value, which is associated with the coherence-inducing singular 





— Exact 

-»-Approx 





Exact Gamma of Synthetic Datasets 



Gamma Estimation Error, Decay = SLOW 




Figure 3: Experiments with synthetic matrices, (a) True coherence associated 
with 'low', 'mid' and 'high' coherences, (b-d) Exact low-rank experiments mea- 
suring difference between the exact coherence and the estimate by Estimate- 
Coherence, (c-f ) Experiments with low-rank matrices in the presence of noise, 
comparing exact and estimated coherence with two different levels of noise. 



vector, and hence the more likely it will be captured by sampled columns. 

Next, we examined the scenario of low-rank matrices with noise, working 
with the 'MEDIUM' decaying matrices used in the low-rank experiments. To 
create a noisy matrix from each original low-rank matrix, we first used the QR 
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Dataset 


Type of data 


# Points (n) 


# Features (d) 


Kernel 


NIPS 


bag of words 


1500 


12419 


linear 


PIE 


face images 


2731 


2304 


linear 


MNIS 


digit images 


4000 


784 


linear 


Essential 


proteins 


4728 


16 


RBF 


Abalone 


abalones 


4177 


8 


RBF 


Dexter 


bag of words 


2000 


20000 


linear 


KIN-8nm 


kinematics of robot arm 


2000 


8 


polynomial 



Table 1: Description of real datasets used in our coherence experiments, in- 
cluding the type of data, the number of points (n), the number of features (d) 
and the choice of kernel (Asuncion and Newman, 2007; Gustafson et ai, 2006; 
LeCun and Cortes, 1998; Sim et at, 2002). 



algorithm to find a full orthogonal basis containing the r left singular vectors 
of the original matrix, and used it as our new left singular vectors (we repeated 
this procedure to obtain right singular vectors). We then defined each of the 
remaining n — r singular values of our noisy matrix to equal some fraction of 
the rth singular value of the original matrix (0.1 for 'SMALL' noise and 0.9 
'LARGE' noise). The performance of Estimate-Coherence on these noisy 
matrices is presented in Figure 3(e-f), where results are means and standard 
deviations of 10 trials for each value of I. The presence of noise clearly has a 
negative affect on performance, yet the estimates arc quite accurate for I = 2r 
in the 'LOW noise scenario, and even for the high coherence matrices with 
'LARGE' noise, the estimate is fairly accurate when / > 4r. 

5.2 Experiments with real data 

We next performed experiments using the datasets listed in Table 1. We used a 
variety of kernel functions to generate SPSD kernel matrices from these datasets, 
with the resulting kernel matrices being quite varied in coherence (see Figure 
4(a)). We then used Estimate- Coherence with r set to equal the number 
of singular values needed to capture 99% of the spectral energy of each kernel 
matrix. Figure 4(b) shows the estimation error over 10 trials. Although the co- 
herence is well estimated across datasets when I > 100, the estimates for the two 
high coherence datasets (nips and dext) converge most slowly and exhibit the 
most variance across trials. Next, we performed spectral reconstruction using 
the Nystrom method and matrix projection reconstruction using the Column- 
sampling method, and report results over 10 trials in Figure 4(c-d). The results 
clearly illustrate the connection between matrix coherence and the quality of 
these low-rank approximation techniques, as the two high coherence datasets 
exhibit significantly worse performance than the remaining datasets. 
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Figure 4: Experiments with real data, (a) True coherence of each kernel matrix 
K. (b) Difference between the true coherence and the estimated coherence, (c-d) 
Quality of two types of low-rank matrix approximations (K), where 'Normalized 
Error' equals ||K — K||f/||K||f- 

6 Conclusion 

We proposed a novel algorithm to estimate matrix coherence. Our theoretical 
analysis shows that Estimate-Coherence provides good estimates for rela- 
tively low-coherence matrices, and more generally, its effectiveness is tied to 
coherence itself. We corroborate this finding for high-coherence matrices with 
an advcrsarially chosen dataset and sampling scheme. Empirically, however, 
our algorithm efficiently and accurately estimates coherence across a wide range 
of datasets, and these estimates are excellent predictors of the effectiveness of 
sampling-based matrix approximation. We believe that our algorithm should be 
used whenever low-rank matrix approximation is being considered to determine 
its applicability on a case-by-case basis. Moreover, the variance of coherence es- 
timates across multiple samples may provide further information, and the use of 
multiple samples fits nicely in the framework of ensemble methods for low-rank 
approximation, e.g., Kumar et al. (2009a). 
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